Client Report - Late Flights & Missing Data (JSON)

Course DS 250

Author

[STUDENT NAME]

Show the code
import pandas as pd
import numpy as np
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here
df = pd.read_json("https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json")

Elevator pitch

A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)

From this summary we can learn that November is the best time to travel. We can also learn that SLC has the least amount of delays by weather.

QUESTION|TASK 1

Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__

In this question, I did a simple filtering and relpacement in order to cleanly show “NaN” using df.replace.

Show the code
missing_tokens = ["", " ", "NA", "N/A", "null", None]
df.replace(missing_tokens, np.nan, inplace=True)

for col in df.columns:
    if "num_of_" in col or "minutes_delayed" in col:
        df[col] = pd.to_numeric(df[col], errors="coerce")

df["month"] = df["month"].replace({"Febuary": "February"})

one_row_json = (
    df.sample(1)
      .replace({np.nan: "NaN"})
      .to_json(orient="records", indent=2)
)

one_row_json
'[\n  {\n    "airport_code":"IAD",\n    "airport_name":"Washington, DC: Washington Dulles International",\n    "month":"November",\n    "year":2012.0,\n    "num_of_flights_total":6207,\n    "num_of_delays_carrier":245.0,\n    "num_of_delays_late_aircraft":331,\n    "num_of_delays_nas":189,\n    "num_of_delays_security":1,\n    "num_of_delays_weather":10,\n    "num_of_delays_total":777,\n    "minutes_delayed_carrier":13652.0,\n    "minutes_delayed_late_aircraft":23025,\n    "minutes_delayed_nas":-999.0,\n    "minutes_delayed_security":16,\n    "minutes_delayed_weather":329,\n    "minutes_delayed_total":46556\n  }\n]'

QUESTION|TASK 2

Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

For this table, I decided to look into the avgerage delay hours by airport to be able to easily see which airport was doing the worst. I used lambda to do this.

Show the code
summary = (
    df.groupby("airport_code", dropna=True)
      .agg(
          total_flights   = ("num_of_flights_total", "sum"),
          avg_delay_mins  = ("minutes_delayed_total", "mean")
      )
      .assign(
          avg_delay_hrs = lambda d: d.avg_delay_mins / 60
      )
      .drop(columns="avg_delay_mins")
      .sort_values("avg_delay_hrs", ascending=False)
)

summary.head()
total_flights avg_delay_hrs
airport_code
ORD 3597588 7115.672854
ATL 4430047 6816.152273
SFO 1630945 3352.334975
DEN 2513974 3178.457197
IAD 851571 1298.418939

QUESTION|TASK 3

What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)

I did he same type of averaging idea for this question, but I averaged by month. It can clearly be seen that the best month to go is in November, due to the low amount of delays comparatively.

Show the code
# Drop rows with missing months
month_df = df.dropna(subset=["month"])

month_tbl = (
    month_df.groupby("month")
      .agg(avg_delay_mins=("minutes_delayed_total", "mean"))
      .reset_index()
)


month_order = [
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
]

month_tbl["month"] = pd.Categorical(month_tbl["month"], categories=month_order, ordered=True)
month_tbl["avg_delay_hrs"] = month_tbl["avg_delay_mins"] / 60
month_tbl = month_tbl.sort_values("month")

from lets_plot import *

ggplot(month_tbl) + \
    geom_bar(aes(x="month", y="avg_delay_hrs"), stat="identity", fill="#1f77b4") + \
    ggtitle("Average Delay Time by Month (in Hours)") + \
    xlab("Month") + ylab("Avg Delay (Hours)") + \
    theme(axis_text_x=element_text(angle=45, hjust=1)) + \
    ggsize(700, 350)

QUESTION|TASK 4

According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:

a. 100% of delayed flights in the Weather category are due to weather  
a. 30% of all delayed flights in the Late-Arriving category are due to weather  
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%    

This analysis is able to take a deeper look into possible different weather delays that are not just severe. From this, we can see that ATL has the most delays.

Show the code
# Include and execute your code here
# Make sure values are numeric
for col in ["num_of_delays_weather", "num_of_delays_late_aircraft", "num_of_delays_nas"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")

# Fill missing late aircraft delays with column mean
df["num_of_delays_late_aircraft"].fillna(df["num_of_delays_late_aircraft"].mean(), inplace=True)
Show the code
# Include and execute your code here

def compute_weather(row):
    weather = row["num_of_delays_weather"] or 0
    late_aircraft = 0.30 * row["num_of_delays_late_aircraft"]
    

    spring_summer = ["April", "May", "June", "July", "August"]
    nas_pct = 0.40 if row["month"] in spring_summer else 0.65
    nas = nas_pct * (row["num_of_delays_nas"] or 0)
    
    return weather + late_aircraft + nas

df["weather_all"] = df.apply(compute_weather, axis=1)
Show the code
# Include and execute your code here
df[[
    "airport_code", "month",
    "num_of_delays_weather", "num_of_delays_late_aircraft", "num_of_delays_nas",
    "weather_all"
]].head()
airport_code month num_of_delays_weather num_of_delays_late_aircraft num_of_delays_nas weather_all
0 ATL January 448 -999 4598 3137.00
1 DEN January 233 928 935 1119.15
2 IAD January 61 1058 895 960.15
3 ORD January 306 2255 5415 4502.25
4 SAN January 56 680 638 674.70

QUESTION|TASK 5

Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.

SFO has the most delays from weather that are mild and severe.

Show the code
# Include and execute your code here
df["weather_delay_prop"] = df["weather_all"] / df["num_of_flights_total"]

weather_by_airport = (
    df.groupby("airport_code", dropna=True)
      .agg(prop_weather_delay=("weather_delay_prop", "mean"))
      .reset_index()
      .sort_values("prop_weather_delay", ascending=False)
)

from lets_plot import *

(ggplot(weather_by_airport) +
 geom_bar(aes(x="airport_code",
              y="prop_weather_delay"),
          stat="identity",
          fill="#1f77b4") +
 ggtitle("Proportion of Flights Delayed by Weather (Severe + Mild)") +
 xlab("Airport") + ylab("Proportion delayed") +
 theme(axis_text_x=element_text(angle=45, hjust=1)) +
 ggsize(700, 400))

STRETCH QUESTION|TASK 1

Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.

type your results and analysis here

Show the code
# Include and execute your code here